New AI Tool Searches Millions of Historical Newspaper Pages
2020-10-01
LRC
TXT
大字
小字
滚动
全页
1A new search tool uses machine learning to search millions of U.S. newspaper pages for historical pictures.
2The U.S. Library of Congress recently launched the tool, called Newspaper Navigator.
3The online search system is available for free to the public.
4The Library of Congress is the world's largest library.
5It offers materials from the creative record of the United States.
6The library serves as the main research service for the U.S. Congress.
7Newspaper Navigator currently permits users to search more than 16 million pages from newspapers across the country, from 1900 to 1963.
8The newspaper pages were digitized for another Library of Congress project, called Chronicling America.
9This tool also permits searches across the library's 16 million newspaper pages.
10The pages contain more than 1.5 million images.
11The Chronicling America system permits users to find and look at full newspaper pages as digitized images.
12Users can also search the collection by keyword, using optical character recognition -- OCR.
13OCR is a tool that uses digital cameras to identify printed characters on a page for searches or to produce text.
14This meant that people using the Chronicling America site had to search through newspaper pages themselves when trying to find specific images.
15The new Newspaper Navigator tool offers the ability to carry out searches based on image-only content in the collection.
16This is where the machine-learning methods come in.
17The search system was trained to recognize different kinds of images.
18For example, it was designed to tell the difference between photos, maps, comics, advertisements, etc.
19It can also identify similar images and return these in search results.
20Benjamin Lee created the system. He is a member of the Library of Congress' Innovator in Residence Program.
21The program was established to sponsor people from different fields to create new ways to present the library's huge historical collections to the public.
22Lee trained a machine-learning model to identify the visual content and then ran the model over all 16 million pages in Chronicling America.
23His training model was based on another Library of Congress experiment called Beyond Words.
24That project invited members of the public to help identify cartoons, drawings, pictures and advertisements in newspapers during World War I.
25Lee said that after he learned of the Beyond Words experiment, he saw a great possibility to use that information to power his machine-learning tool.
26"I began to wonder whether this identified visual content was the key to throwing open the treasure chest of visual content, throughout all 16 million pages in Chronicling America."
27Newspaper Navigator works like other search engines. Users enter a search term in the "keyword" box.
28They can also choose to limit search results by location, as well as by date.
29But one of the most powerful tools in the system is the ability to search images by visual similarity.
30Users of the tool can save images to a personal "collection."
31They can then use those images as a basis for finding other visually similar images across the library's full collection.
32The system even permits users to "retrain" the machine learning tool for individual searches.
33This is done by examining the images that the search returns.
34By selecting whether images found were similar or not similar to the desired result, the user is "retraining" the system to improve its search performance.
35A demonstration of the Newspaper Navigator is available to help users learn more about the tool and how to carry out different searches.
36The creators hope the tool can be useful for historians, reporters, educators, professional researchers or anyone interested in learning about U.S. history through newspapers.
37The Library of Congress notes that all images included in Newspaper Navigator and Chronicling America are in the public domain, meaning people are free to use them as they wish.
38I'm Bryan Lynn.
1A new search tool uses machine learning to search millions of U.S. newspaper pages for historical pictures. 2The U.S. Library of Congress recently launched the tool, called Newspaper Navigator. The online search system is available for free to the public. 3The Library of Congress is the world's largest library. It offers materials from the creative record of the United States. The library serves as the main research service for the U.S. Congress. 4Newspaper Navigator currently permits users to search more than 16 million pages from newspapers across the country, from 1900 to 1963. 5The newspaper pages were digitized for another Library of Congress project, called Chronicling America. This tool also permits searches across the library's 16 million newspaper pages. The pages contain more than 1.5 million images. 6The Chronicling America system permits users to find and look at full newspaper pages as digitized images. Users can also search the collection by keyword, using optical character recognition -- OCR. OCR is a tool that uses digital cameras to identify printed characters on a page for searches or to produce text. 7This meant that people using the Chronicling America site had to search through newspaper pages themselves when trying to find specific images. The new Newspaper Navigator tool offers the ability to carry out searches based on image-only content in the collection. 8This is where the machine-learning methods come in. The search system was trained to recognize different kinds of images. For example, it was designed to tell the difference between photos, maps, comics, advertisements, etc. It can also identify similar images and return these in search results. 9Benjamin Lee created the system. He is a member of the Library of Congress' Innovator in Residence Program. The program was established to sponsor people from different fields to create new ways to present the library's huge historical collections to the public. 10Lee trained a machine-learning model to identify the visual content and then ran the model over all 16 million pages in Chronicling America. 11His training model was based on another Library of Congress experiment called Beyond Words. That project invited members of the public to help identify cartoons, drawings, pictures and advertisements in newspapers during World War I. 12Lee said that after he learned of the Beyond Words experiment, he saw a great possibility to use that information to power his machine-learning tool. "I began to wonder whether this identified visual content was the key to throwing open the treasure chest of visual content, throughout all 16 million pages in Chronicling America." 13Newspaper Navigator works like other search engines. Users enter a search term in the "keyword" box. They can also choose to limit search results by location, as well as by date. 14But one of the most powerful tools in the system is the ability to search images by visual similarity. Users of the tool can save images to a personal "collection." They can then use those images as a basis for finding other visually similar images across the library's full collection. 15The system even permits users to "retrain" the machine learning tool for individual searches. This is done by examining the images that the search returns. By selecting whether images found were similar or not similar to the desired result, the user is "retraining" the system to improve its search performance. 16A demonstration of the Newspaper Navigator is available to help users learn more about the tool and how to carry out different searches. The creators hope the tool can be useful for historians, reporters, educators, professional researchers or anyone interested in learning about U.S. history through newspapers. 17The Library of Congress notes that all images included in Newspaper Navigator and Chronicling America are in the public domain, meaning people are free to use them as they wish. 18I'm Bryan Lynn. 19Bryan Lynn wrote this story for VOA Learning English, based on reports from the Library of Congress. Ashley Thompson was the editor. 20We want to hear from you. Write to us in the Comments section, and visit our Facebook page. 21________________________________________________________________ 22Words in This Story 23page - n. one part of a website 24digitize - v. to put information into the form or a series of numbers, usually so that it can be understood by a computer 25character - n. a letter, number or other mark or sign used in writing or printing 26comics - n. a series of pictures that tell a story 27content - n. information contained in a piece of writing, a speech, a movie or on the internet 28visual - adj. related to seeing 29sponsor - v. to pay for someone to do something or for something to happen 30location - n. place where something takes place